Superscalar GEMM-based Level 3 BLAS - The On-going Evolution of a Portable and High-Performance Library

نویسندگان

Fred G. Gustavson

André Henriksson

Isak Jonsson

Bo Kågström

Per Ling

چکیده

Recently, a rst version of our GEMM-based level 3 BLAS for superscalar type processors was announced. A new feature is the inclusion of DGEMM itself. This DGEMM routine contains inline what we call a level 3 kernel routine, which is based on register blocking. Additionally, it features level 1 cache blocking and data copying of sub-matrix operands for the level 3 kernel. Our other BLAS's which possess triangular operands, e.g., DTRSM, DSYRK use a similar level 3 kernel routine to handle the triangular blocks that appear on the diagonal of the larger input triangular operand. Like our previous GEMM-based work all other BLAS's perform the dominating part of the computations in calls to DGEMM. We are seeing the adoption of our BLAS's by several organizations, including the ATLAS and PHiPAC projects on automatic generation of fast DGEMM kernels for superscalar processors, and some computer vendors. The evolution of the superscalar GEMM-based level 3 BLAS is presented. Also, we describe new developments which include techniques that make the library applicable to symmetric multiprocess-ing (SMP) systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Triangular Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking

We present recursive blocked algorithms for solving triangular Sylvester-type matrix equations. Recursion leads to automatic blocking that is variable and \squarish". The main part of the computations are performed as level 3 general matrix multiply and add (GEMM) operations. We also present new highly optimized superscalar kernels for solving small-sized matrix equations stored in level 1 cach...

متن کامل

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

In this work, we evaluate OpenCL as a programming tool for developing performanceportable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide ...

متن کامل

Automating the Last-Mile for High Performance Dense Linear Algebra

High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. The real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex LAPACK routines. Achieving high performance ...

متن کامل

Algorithm Xyz. Gemm{based Level 3 Blas: Installation, Tuning and Use of the Model Implemen- Tations and the Performance Evaluation Benchmark

The GEMM-based level 3 BLAS model implementations, which are structured to eeectively reduce data traac in a memory hierarchy, and the performance evaluation benchmark, which is a tool for evaluating and comparing diierent implementations of the level 3 BLAS with the GEMM-based model implementations are presented in 5]. Here, the installation and tuning of the Fortran 77 model implementations, ...

متن کامل

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

A particularly challenging class of problems arising in many applications, called batched problems, involves linear algebra operations on many small-sized matrices. We proposed and designed batched BLAS (Basic Linear Algebra Subroutines), Level-2 GEMV and Level-3 GEMM, to solve them. We illustrate how to optimize batched GEMV and GEMM to assist batched advance factorization (e.g. bi-diagonaliza...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1998

Superscalar GEMM-based Level 3 BLAS - The On-going Evolution of a Portable and High-Performance Library

نویسندگان

چکیده

منابع مشابه

Parallel Triangular Sylvester-Type Matrix Equation Solvers for SMP Systems Using Recursive Blocking

From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming

Automating the Last-Mile for High Performance Dense Linear Algebra

Algorithm Xyz. Gemm{based Level 3 Blas: Installation, Tuning and Use of the Model Implemen- Tations and the Performance Evaluation Benchmark

MAGMA Batched: A Batched BLAS Approach for Small Matrix Factorizations and Applications on GPUs

عنوان ژورنال:

اشتراک گذاری